PCA and cluster analysis replication in R

Introduction

This study will seek to compare two statistical analyzes performed with the same data set, but with different programs. On the one hand, the original work by Vicencio (2019) was carried out using the JMP statistics program. The replication of this research will be through the program Rstudio. The same data will be used and the same graphs will be sought, which will make it possible to compare the two sets of results and thus identify the benefits and disadvantages of each of the programs.

The work of Vicencio (2019) sought, as the first approach to its final objective, to identify if it is possible to distinguish different clusters from the same obsidian deposit. Vicencio collected a total of 334 obsidian samples from a 120km2 region, all related to the El Paredón deposit. To achieve his first objective, the author performed a series of PCA-type analyzes and k-means clusters using the 334 samples. A previous geochemical analysis was made using X-ray fluorescence, providing semi-quantitative information, given in parts per million (ppm) of ten elements: Mn, Fe, Zn, Ga, Th, Rb, Sr, Y, Zr, and Nb.

In his results, Vicencio identifies two main sub-sources,El Paredon and Tres Cabezas, both visible in the statistic analysis and with GIS.

PCA and cluster analysis will be carried out with the R program to compare the results of the two studies.

First of all the libraries and dataset must be loaded.

library(curl)
## Using libcurl 7.77.0 with LibreSSL/2.8.3
library(MASS)
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(car)
## Loading required package: carData

Upload Data

f <- curl("https://raw.githubusercontent.com/gabovicen/gabovicen-data-replication-assignment/main/Replica_R.2.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
##   FRX.Sample         Site  Mn    Fe Zn Ga Th  Rb Sr  Y  Zr Nb Cluster
## 1     PARM1A Sub-source 1 317 11090 60 21 17 168 -1 57 261 44       1
## 2     PARM1B Sub-source 1 341 10745 63 22 16 168  1 53 262 47       1
## 3     PARM1C Sub-source 1 510 11128 70 20 15 162  3 52 279 48       1
## 4     PARM8A Sub-source 1 496 11752 51 19 20 170  3 60 271 49       1
## 5     PARM8B Sub-source 1 538 11118 69 22 19 162  2 63 303 50       1
## 6     PARM8C Sub-source 1 694 12255 77 27 23 164  3 61 296 44       1
summary(d)
##   FRX.Sample            Site                 Mn               Fe       
##  Length:334         Length:334         Min.   :  75.0   Min.   : 5120  
##  Class :character   Class :character   1st Qu.: 336.0   1st Qu.: 8680  
##  Mode  :character   Mode  :character   Median : 381.5   Median :10067  
##                                        Mean   : 389.9   Mean   :10097  
##                                        3rd Qu.: 429.0   3rd Qu.:10980  
##                                        Max.   :1208.0   Max.   :20821  
##        Zn               Ga              Th              Rb       
##  Min.   : 40.00   Min.   :10.00   Min.   : 9.00   Min.   :113.0  
##  1st Qu.: 58.00   1st Qu.:18.00   1st Qu.:15.00   1st Qu.:157.0  
##  Median : 65.00   Median :21.00   Median :17.00   Median :162.0  
##  Mean   : 65.32   Mean   :21.03   Mean   :16.56   Mean   :162.4  
##  3rd Qu.: 72.00   3rd Qu.:23.00   3rd Qu.:18.00   3rd Qu.:168.0  
##  Max.   :107.00   Max.   :36.00   Max.   :24.00   Max.   :196.0  
##        Sr               Y               Zr              Nb       
##  Min.   :-1.000   Min.   :39.00   Min.   :138.0   Min.   :34.00  
##  1st Qu.: 2.000   1st Qu.:47.00   1st Qu.:191.0   1st Qu.:39.00  
##  Median : 3.000   Median :51.00   Median :203.5   Median :42.00  
##  Mean   : 3.596   Mean   :52.42   Mean   :232.1   Mean   :43.04  
##  3rd Qu.: 5.000   3rd Qu.:59.00   3rd Qu.:280.0   3rd Qu.:47.00  
##  Max.   :35.000   Max.   :69.00   Max.   :389.0   Max.   :55.00  
##     Cluster     
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.539  
##  3rd Qu.:2.000  
##  Max.   :2.000
str(d)
## 'data.frame':    334 obs. of  13 variables:
##  $ FRX.Sample: chr  "PARM1A" "PARM1B" "PARM1C" "PARM8A" ...
##  $ Site      : chr  "Sub-source 1" "Sub-source 1" "Sub-source 1" "Sub-source 1" ...
##  $ Mn        : int  317 341 510 496 538 694 557 574 634 503 ...
##  $ Fe        : int  11090 10745 11128 11752 11118 12255 10498 10498 10958 10023 ...
##  $ Zn        : int  60 63 70 51 69 77 70 77 62 67 ...
##  $ Ga        : int  21 22 20 19 22 27 21 24 17 25 ...
##  $ Th        : int  17 16 15 20 19 23 22 20 24 20 ...
##  $ Rb        : int  168 168 162 170 162 164 157 160 163 149 ...
##  $ Sr        : int  -1 1 3 3 2 3 3 3 3 2 ...
##  $ Y         : int  57 53 52 60 63 61 63 62 61 55 ...
##  $ Zr        : int  261 262 279 271 303 296 274 269 275 254 ...
##  $ Nb        : int  44 47 48 49 50 44 49 48 45 43 ...
##  $ Cluster   : int  1 1 1 1 1 1 1 1 1 1 ...
  1. Then, we can do a scatter plot with the ten elements to see how the 334 samples can be distributed within each other.
scatterplotMatrix(~ Fe + Zn + Ga + Th + Rb + Sr + Y + Zr + Nb, data=d,legend = TRUE,smooth = list(method=gamLine),diagonal = TRUE,plot.points = TRUE)

Figure 1. Scatter plot from Vicencio, side B of the thesis, Vicencio 2019

As can seen from the two figures, the scatter plot analysis is better represented by the R analysis. It can be seen more clearly how the formation of two groups can already be seen from this scatter plot distribution. The elements Y and Zr the best representatives for dispersion.

Ploting trace elements

With the scatter plot matrix, Vicencio (2019) identifies specific elements that can reveal a better distribution pattern between the two sub-sources, being: Zn, Sr, Y, Zr, and Nb.

Let’s see how these elements are plotted.

  1. Zinc (Zn)
d[2]<-as.factor(d$Site)
plot(d$Site, d$Zn, main = "Zn Plot", ylab = "Zn")

  1. Strontium (Sr)
d[2]<-as.factor(d$Site)
plot(d$Site, d$Sr, main = "Sr Plot", ylab = "Sr")

  1. Yttrium (Y)
d[2]<-as.factor(d$Site)
plot(d$Site, d$Y, main = "Y Plot", ylab = "Y")

  1. Zirconium (Zr)
d[2]<-as.factor(d$Site)
plot(d$Site, d$Zr, main = "Zr Plot", ylab = "Zr")

  1. Niobium (Nb)
d[2]<-as.factor(d$Site)
plot(d$Site, d$Nb, main = "Nb Plot", ylab = "Nb")

##http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

Principal Component Analysis in R

Principal component analysis (PCA) allows us to summarize and to visualize the information in a data set containing individuals/observations described by multiple inter-correlated quantitative variables. Each variable could be considered as a different dimension.

Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components. These new variables correspond to a linear combination of the originals (Kassambara 2017)

Arguments for prcomp():

x: a numeric matrix or data frame scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place Arguments for princomp(): x: a numeric matrix or data frame cor: a logical value. If TRUE, the data will be centered and scaled before the analysis scores: a logical value. If TRUE, the coordinates on each principal component are calculated

library("factoextra")
data(d)
## Warning in data(d): data set 'd' not found
XRF.elements <- d[1:12, 3:12]
head(d[, 1:12])
##   FRX.Sample         Site  Mn    Fe Zn Ga Th  Rb Sr  Y  Zr Nb
## 1     PARM1A Sub-source 1 317 11090 60 21 17 168 -1 57 261 44
## 2     PARM1B Sub-source 1 341 10745 63 22 16 168  1 53 262 47
## 3     PARM1C Sub-source 1 510 11128 70 20 15 162  3 52 279 48
## 4     PARM8A Sub-source 1 496 11752 51 19 20 170  3 60 271 49
## 5     PARM8B Sub-source 1 538 11118 69 22 19 162  2 63 303 50
## 6     PARM8C Sub-source 1 694 12255 77 27 23 164  3 61 296 44

Compute PCA in R using prcomp()

  1. Compute PCA
res.pca <- prcomp(XRF.elements, scale = TRUE)

2.Visualize eigenvalues (scree plot). Show the percentage of variances explained by each principal component.

fviz_eig(res.pca)

get_eig(res.pca)
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.360459958      33.60459958                    33.60460
## Dim.2  2.423465909      24.23465909                    57.83926
## Dim.3  1.417447119      14.17447119                    72.01373
## Dim.4  1.138139078      11.38139078                    83.39512
## Dim.5  0.758386276       7.58386276                    90.97898
## Dim.6  0.422621291       4.22621291                    95.20520
## Dim.7  0.289927357       2.89927357                    98.10447
## Dim.8  0.150518676       1.50518676                    99.60966
## Dim.9  0.033703410       0.33703410                    99.94669
## Dim.10 0.005330927       0.05330927                   100.00000
  1. Graph of individuals. Individuals with a similar profile are grouped together.
fviz_pca_ind(res.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

  1. Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

#  Here we see that most of the elements (variables) point to the same direction.
  1. Biplot of individuals and variables
fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

if I would like to predict the new samples inside the two PC’s.

ind.sup.coord <- predict(res.pca, newdata = XRF.elements)
ind.sup.coord[, 1:3]
##           PC1        PC2        PC3
## 1  -3.1256292 -0.9426716  1.4899726
## 2  -2.7684170 -0.5050967  0.5879585
## 3  -0.8406120 -0.3369916 -0.1347065
## 4  -0.3593823 -2.7819518 -1.0415899
## 5   1.3899399 -1.2688419 -0.1038501
## 6   3.2428451 -0.2015854  2.7367338
## 7   1.4207343  0.1940572 -1.2691978
## 8   1.3857145  0.9610954 -0.1756749
## 9   1.1746540 -1.0463507 -0.9965159
## 10 -0.8018672  2.9618846  0.4690704
## 11 -0.4780815  2.1085518 -1.1980067
## 12 -0.2398986  0.8579007 -0.3641934
# Plot of active individuals
p <- fviz_pca_ind(res.pca, repel = TRUE)
# Add supplementary individuals
fviz_add(p, ind.sup.coord, color ="blue")

ind.scaled <- scale(XRF.elements, 
                    center = res.pca$center,
                    scale = res.pca$scale)
# Coordinates of the individividuals
coord_func <- function(ind, loadings){
  r <- loadings*ind
  apply(r, 2, sum)
}
pca.loadings <- res.pca$rotation
ind.sup.coord <- t(apply(ind.scaled, 1, coord_func, pca.loadings ))
ind.sup.coord[, 1:3]
##           PC1        PC2        PC3
## 1  -3.1256292 -0.9426716  1.4899726
## 2  -2.7684170 -0.5050967  0.5879585
## 3  -0.8406120 -0.3369916 -0.1347065
## 4  -0.3593823 -2.7819518 -1.0415899
## 5   1.3899399 -1.2688419 -0.1038501
## 6   3.2428451 -0.2015854  2.7367338
## 7   1.4207343  0.1940572 -1.2691978
## 8   1.3857145  0.9610954 -0.1756749
## 9   1.1746540 -1.0463507 -0.9965159
## 10 -0.8018672  2.9618846  0.4690704
## 11 -0.4780815  2.1085518 -1.1980067
## 12 -0.2398986  0.8579007 -0.3641934

Figure 2. Principal component analysis with the dispersion of two clusters, influenced by the elements: Fe, Sr, Y and Zr. Graph of the three main components that covers 70% of the elemental variance (Vicencio 2019:Figure 39)

The representation of the principal components analysis in the research of Vicencio (2019) was a little clearer, due to how the data was represented. Although, there is a differentiation regarding the trace elements with greater influence among the analyzes, how the data is represented with JMP seems to be more useful when trying to see the sample distribution and clustering.

CLUSTER ANALYSIS

##AN/BI 588: Course Modules-Cluster Analysis

Cluster analysis uncovers subgroups of observations within a dataset by reducing a large number of observations to a smaller number of clusters. A cluster is a group of observations that are more similar to each other than they are to the observations in other groups (Arora et al.)

library(rattle.data)
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
## 
##     Predict
library(curl)
f <- curl("https://raw.githubusercontent.com/gabovicen/gabovicen-data-replication-assignment/main/Replica_R.2.csv")
clusterdata <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(clusterdata)
##   FRX.Sample         Site  Mn    Fe Zn Ga Th  Rb Sr  Y  Zr Nb Cluster
## 1     PARM1A Sub-source 1 317 11090 60 21 17 168 -1 57 261 44       1
## 2     PARM1B Sub-source 1 341 10745 63 22 16 168  1 53 262 47       1
## 3     PARM1C Sub-source 1 510 11128 70 20 15 162  3 52 279 48       1
## 4     PARM8A Sub-source 1 496 11752 51 19 20 170  3 60 271 49       1
## 5     PARM8B Sub-source 1 538 11118 69 22 19 162  2 63 303 50       1
## 6     PARM8C Sub-source 1 694 12255 77 27 23 164  3 61 296 44       1
  1. Lets use the 10 elements for the cluster analysis
ds1<- scale(clusterdata[2:12][-1])

As a first glimpse, we can plot the quantitative data to see a general pattern for the cluster analysis.

plot(ds1)

plot1 <- function(data = ds1, nc=6, seed=123456){ 
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")
  }

This will help us dtermine how many clusters we want. nevertheless, as Vicencio’s analysis centers on two, we can use this step just as a reference for other options.

plot1(ds1)

The graph we have plotted above has a bend at two, which suggests we should use two clusters, as the original analysis suggested.

This graph is especially useful if you know nothing about your data set.

library(NbClust)
set.seed(1234)
devAskNewPage(ask=TRUE)
nc1<-NbClust(ds1, min.nc=2, max.nc=6, method="kmeans") 

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 15 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 6 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

If we want to know the distribution of the 334 samples within the two clusters, we can use this function.

set.seed(12)
fit.kmseed<-kmeans(ds1, 2, nstart=25) # performs the k-means cluster analysis
fit.kmseed$size
## [1] 154 180

In this case, 154 samples would be related to one cluster, meanwhile the 180 remaining will be related to the second cluster, revealing the same results as those shown in Vicencio’s research.

Figure 3. K-Means Analysis, side B, Vicencio 2019

fit.kmseed$centers
##            Mn         Fe         Zn         Ga         Th         Rb         Sr
## 1  0.01766557  0.6229535  0.6207665  0.3548979  0.2175302  0.5352394 -0.5906522
## 2 -0.01511388 -0.5329713 -0.5311002 -0.3036349 -0.1861091 -0.4579270  0.5053357
##            Y         Zr         Nb
## 1  0.9660370  0.9696201  0.9355399
## 2 -0.8264983 -0.8295639 -0.8004063
# cross-tabulation of type and cluster membership:
ct.kmseed<-table(clusterdata$Site, fit.kmseed$cluster)
ct.kmseed
##               
##                  1   2
##   Sub-source 1 154   0
##   Sub-source 2   0 180
library(flexclust)
randIndex(ct.kmseed)
## ARI 
##   1
fit.kmseed$cluster # these are each of the clusters
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2
library(cluster)
clusplot(ds1, fit.kmseed$cluster, main='2D Representation of the Cluster Solution',
         color=TRUE, shade=TRUE, plotchar = TRUE, labels=2, lines=0)

  1. Now, lets use Zr and Nb, two elements with special distribution properties, as mentioned by Vicencio (2019).
ds2<- scale(clusterdata[10:12][-1])

As a first glipse, we can plot the cuantitative data to see a general pattern for the cluster analysis.

plot(ds2)

plot1 <- function(data = ds2, nc=6, seed=123456){ 
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")
  }
plot1(ds2)

library(NbClust)
set.seed(1234)
devAskNewPage(ask=TRUE)
nc1<-NbClust(ds2, min.nc=2, max.nc=6, method="kmeans") 

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 15 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 8 proposed 4 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
set.seed(12)
fit.kmseed<-kmeans(ds2, 2, nstart=25) # performs the k-means cluster analysis
fit.kmseed$size
## [1] 154 180

The same sample distribution as in ds1(10 elements )

fit.kmseed$centers
##           Zr         Nb
## 1  1.0094837  0.9355399
## 2 -0.8636694 -0.8004063
# cross-tabulation of type and cluster membership:
ct.kmseed<-table(clusterdata$Site, fit.kmseed$cluster)
ct.kmseed
##               
##                  1   2
##   Sub-source 1 149   5
##   Sub-source 2   5 175
library(flexclust)
randIndex(ct.kmseed)
##       ARI 
## 0.8834752
fit.kmseed$cluster # these are each of the clusters
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2
library(cluster)
clusplot(ds2, fit.kmseed$cluster, main='2D Representation of the Cluster Solution',
         color=TRUE, shade=TRUE, plotchar = TRUE, labels=2, lines=0)

It is clear that with Zr and Nb the clutsers are better defined.

This result is close to what we see in the original work. The use of trace elements with greater influence on the distribution and conglomeration of groups provides a better result.

Figure 4. Bivariate graph of the two main components, with clustering ellipses at 95 percent probability, made from the two trace elements with the greatest variation: Sr and Zr, Vicencio 2019:Figure 44)

  1. As the last excercise, lets see if its posible to identify three clusters.
ds3<- scale(clusterdata[10:12][-1])
set.seed(12)
fit.kmseed<-kmeans(ds3, 3, nstart=25) # performs the k-means cluster analysis
fit.kmseed$size
## [1] 178  67  89
fit.kmseed$centers
##           Zr         Nb
## 1 -0.8723587 -0.8104284
## 2  0.7753195  0.4083707
## 3  1.1610499  1.3134318
# cross-tabulation of type and cluster membership:
ct.kmseed<-table(clusterdata$Site, fit.kmseed$cluster)
ct.kmseed
##               
##                  1   2   3
##   Sub-source 1   3  63  88
##   Sub-source 2 175   4   1
library(flexclust)
randIndex(ct.kmseed)
##       ARI 
## 0.7217842
fit.kmseed$cluster # these are each of the clusters
##   [1] 2 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 3 3 2 3 3 3 3 3 3 2 3 3 3 3 1 1 3 3
##  [38] 3 3 2 3 3 3 2 2 2 3 3 3 3 2 2 2 3 3 3 3 2 2 2 2 2 3 3 2 3 3 3 3 3 2 3 2 3
##  [75] 3 2 2 2 3 3 1 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3 2 3 3 3 3 2 3 2 3 2 2 2 3 3 2
## [112] 3 3 3 2 2 2 3 2 3 3 3 3 3 2 2 2 3 3 2 2 3 2 2 2 3 2 3 2 2 2 3 2 3 2 3 3 3
## [149] 3 2 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1
library(cluster)
clusplot(ds3, fit.kmseed$cluster, main='2D Representation of the Cluster Solution',
         color=TRUE, shade=TRUE, plotchar = TRUE, labels=2, lines=0)

Even if we try to find three clusters, only two are representative.

Conclusions

As shown in the work by Vicencio (2019), it is possible to identify at least two clusters from the distribution of the 334 geological samples from the El Paredón deposit. I found the two statistical programs extremely useful in representing these data. Although I found some improvements in the way JMP displays the results. I know that the shortcomings of not being able to represent the results in R come from my abilities to handle the program, and I’m sure from what we have seen during class, that in R we as researchers can have more control over the statistical analysis of our work.

References

Arora, Aarti, Andrew Mark, Natalie Robinson, Audrey Tjahjadi (with modifications by Christopher A. Schmitt) Module 25: Cluster Analysis. AN/BI 588: Course Modules. https://fuzzyatelin.github.io/bioanth-stats/module-25/module-25.html

Kassambara, Alboukadel 2017 Principal Component Analysis in R: prcomp vs princomp. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/#general-methods-for-principal-component-analysis

Vicencio, A. Gabriel 2019 El Paredón y Tlaxcala: un estudio regional de un yacimiento de obsidiana durante el Formativo Medio y el Formativo Tardío en Tlaxcala. Unpublished Tesis inédita de Maestría, Universidad Nacional Autónoma de México, Mexico City.